11 research outputs found

    Document classification based on library catalogue metadata

    Get PDF
    Kansalliskirjastojen metadataluettelot ovat hyviä informaatiolähteitä, sillä ne sisältävät tiedon lähes kaikesta tiettynä aikana ja tietyllä alueella julkaistusta aineistosta. Yleensä ne ovat kattavasti kuvailtuja, joten niitä voi käyttää kvantitatiivisen tutkimuksen lähteinä. Usein tutkimusta tehtäessä tutkimusaineisto kannattaa jakaa pienempiin osiin esimerkiksi genren perusteella. Monissa tapauksissa aineiston aukkoisuus kuitenkin vähentää aineiston käytettävyyttä. Tämä pro gradu -työ arvioi mahdollisuutta hyödyntää koneoppimista etsittäessä tutkimukselle relevantteja osajoukkoja kirjastoluetteloista. Esimerkkitapaukseksi valitsin English Short Title Cataloguen (ESTC) ja etsittäväksi osajoukoksi runokirjat. Runokirjojen genretiedon kuuluisi olla annotoitu, mutta todellisista kirjastoluetteloista tämä tieto usein puuttuu. Käytin random forest -algoritmiä perinteisillä tekijän tunnistuksessa ja genreluokittelussa käytetyillä erityyppisillä piirrevektoreilla sekä metadatakenttien arvoilla parhaan tuloksen saamiseksi. Koska kirjastoluettelot eivät sisällä kirjojen koko tekstiä, piirteiden valinta keskittyi otsikoissa käytettyihin sanoihin ja lingvistisiin ominaisuuksiin. Otsikot ovat yleensä lyhyitä ja sisältävät hyvin vähän informaatiota, minkä vuoksi yhdistin piirrevektoreiden parhaiten toimivat piirteet yhteen ja tein lopullisen haun niillä. Tutkimuksen päätulos oli varmistus siitä, että otsikoiden käyttö piirteiden muodostamisessa on käyttökelpoinen strategia. Tutkimus avaa mahdollisuuksia määrittää osajoukkoja tulevaisuudessa koneoppimisen keinoin ja lisätä kirjastoluetteloiden hyödyntämistä kvantitatiivisessa tutkimuksessa

    Bibliographic Data Science and the History of the Book (c. 1500–1800)

    Get PDF
    National bibliographies have been identified as a crucial resource for historical research on the publishing landscape, but using them requires addressing challenges of data quality, completeness, and interpretation. We call this approach bibliographic data science. In this article, we briefly assess the development of book formats and the vernacularization process in early modern Europe. The work undertaken paves the way for more extensive integration of library catalogs to map the history of the book.Peer reviewe

    FinnFN 1.0: The Finnish frame semantic database

    Get PDF
    The article describes the process of creating a Finnish language FrameNet or FinnFN, based on the original English language FrameNet hosted at the International Computer Science Institute in Berkeley, California. We outline the goals and results relating to the FinnFN project and especially to the creation of the FinnFrame corpus. The main aim of the project was to test the universal applicability of frame semantics by annotating real Finnish using the same frames and annotation conventions as in the original Berkeley FrameNet project. From Finnish newspaper corpora, 40,721 sentences were automatically retrieved and manually annotated as example sentences evoking certain frames. This became the FinnFrame corpus. Applying the Berkeley FrameNet annotation conventions to the Finnish language required some modifications due to Finnish morphology, and a convention for annotating individual morphemes within words was introduced for phenomena such as compounding, comparatives and case endings. Various questions about cultural salience across the two languages arose during the project, but problematic situations occurred only in a few examples, which we also discuss in the article. The article shows that, barring a few minor instances, the universality hypothesis of frames is largely confirmed for languages as different as Finnish and English.Peer reviewe

    Bibliographic Data Science and the History of the Book (c. 1500–1800)

    Get PDF
    National bibliographies have been identified as a crucial resource for historical research on the publishing landscape, but using them requires addressing challenges of data quality, completeness, and interpretation. We call this approach bibliographic data science. In this article, we briefly assess the development of book formats and the vernacularization process in early modern Europe. The work undertaken paves the way for more extensive integration of library catalogs to map the history of the book.</p

    A Quantitative Approach to Book-Printing in Sweden and Finland, 1640–1828

    Get PDF
    Several cities in Sweden have been providing book-printing facilities since the 1640s. In our quantitative and explorative analysis of library catalogs from the National Library of Sweden and the National Library of Finland we identify the general trends in publishing, how book-printing has been affected by political events, and how printing developed at different paces in different parts of the realm. We have developed a new method for analyzing the totality of publishing through extensive data harmonization and comprehensive statistical analysis, and by treating library catalogs not as an endpoint of bibliographic research but as an inherently rich source of information. This facilitated the quantitative assessment of printing in the Swedish realm based on the metadata contained in library catalogs. Our data-driven approach to the transformation of public discourse demonstrates that whereas the amount of printed material grew steadily, political ruptures affected the development of printing. We also suggest that the culture of books and printing is best understood through the dynamics of competing intellectual hubs consisting of the university cities and the political center in Stockholm. This perspective further challenges the dominant, nationally delineated approach in book history.</p

    Analytical Edition Detection In Bibliographic Metadata

    Get PDF
    Analytical bibliography's aim is to understand books and other printed objects as artifacts and how they were produced. Bibliographic metadata can represent important historical trends and resolve issues such as the ordering of editions. In this paper, we present the state of the art analytical approach for determining editions and their ordering. By providing harmonized data and information on historical developments in book production, this will be a great aid for projects aiming to do large-scale text mining. Contemporary text mining approaches do not utilize edition level information to the fullest extent and therefore are limited in their scope. Using the ESTC metadata, we have developed harmonizing techniques that convert free-form text into more coherent entries for statistical analysis. Furthermore, a new gold standard was developed for validation purposes, with multiple layers of information. The use of this data would significantly enhance the understanding of early modern publishing.Peer reviewe

    Book Formats and Reading Habits in Early Modern Europe

    No full text
    Abstract and poster of paper 0596 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherlands 9-12 July, 2019
    corecore